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Abstract 


Nonparametric two sample or homogeneity testing is a decision theoretic problem 
that involves identifying differences between two random variables without making 
parametric assumptions about their underlying distributions. The literature is old and 
rich, with a wide variety of statistics having being intelligently designed and analyzed, 
both for the unidimensional and the multivariate setting. Our contribution is to tie 
together many of these tests, drawing connections between seemingly very different 
statistics. In this work, our central object is the Wasserstein distance, as we form 
a chain of connections from univariate methods like the Kolmogorov-Smirnov test, 
PP/QQ plots and ROC/ODC curves, to multivariate tests involving energy statistics 
and kernel based maximum mean discrepancy. Some connections proceed through the 
construction of a smoothed Wasserstein distance, and others through the pursuit of a 
“distribution-free” Wasserstein test. Some observations in this chain are implicit in the 
literature, while others seem to have not been noticed thus far. Given nonparametric 
two sample testing’s classical and continued importance, we aim to provide useful 
connections for theorists and practitioners familiar with one subset of methods but 
not others. 

1 Introduction 

Nonparametric two sample testing (or homogeneity testing) deals with detecting differ¬ 
ences between two d-dimensional distributions, given samples from both, without making 
any parametric distributional assumptions. The popular tests for d = 1 are rather differ¬ 
ent from those for d > 1, and our interest is in tying together different tests used in both 
settings. There is a massive literature on the two-sample problem, having been formally 
studied for nearly a century, and there is no way we can cover the breadth of this huge and 
historic body of work. Our aim is much more restricted — we wish to study this prob¬ 
lem through the eyes of the beautiful Wasserstein distance. We wish to form connections 
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between several seemingly distinct families of such tests, both intuitively and formally, 
in the hope of informing both practitioners and theorists who may have familiarity with 
some sets of tests, but not others. We will also only introduce related work that has a 
direct relationship with this paper. 

There are also a large number of tests for parametric two-sample testing (assuming 
a form for underlying distributions, like Gaussianity), and yet others for testing only 
differences in means of distributions (like Hotelling’s t-test, Wilcoxon’s signed rank test. 
Mood’s median test). Our focus will be much more restricted — in this paper, we will 
restrict our attention only to nonparametric tests for testing differences in (any moment 
of the underlying) distribution. 

Our paper started as an attempt to understand testing with the Wasserstein distance 
(also called earth-mover’s distance or transportation distance). The main prior work in 
this a r ea involve studying t he “trimmed” comparison of distributions bv iMunk and Czado 


19981 ]. Freitag et al. 20^] with applications to biostatistic s, specifically population bioe¬ 


quivalence, and later by Alvarez-Esteban et al. [20081 . 2012 ]. Apart from two-sample test- 
in g, the study of uiiivariate qoodness -of-fit testing (or one-sample testin g) was undertake n 
del Barrio et al. 19991 . 2000l . 2005 ]. and summarized extremely well in del Barrio 2004 1. 


There are other semiparametric works specific to goodness-of-fit testing for location-scale 
families that we do not mention here, since they diverge from our interest in fully non¬ 
parametric two-sample testing for generic distributions. 

In this paper, we uncover an interesting relationship between the multivariate Wasser¬ 
stein test and the (Euclidean) Energy dista nce t est, also called the Cramer te st, proposed 
independently by ISzekelv and Rizzol 20041 ] and iBaringhaus and Franz! 20041 ] . This pro¬ 
ceeds through the construction of a smoothed Wasserstein distance, by adding an entropic 
penalty/regularization — varying the weight of the regularization interpolates between 
the Wasserstein distance at one extreme and the Energy distance at the other extreme. 
This also gives rise to a new connection between the univariate Wasserstein test and 
popular univariate data analysis tools like quantile-quantile (QQ) plots and the Cramer 
von-Mises (CvM) test. Due to the relationship between distan ces and kernels , we w ill also 


establish connections to the kernel-based multivariate test by Cretton et al. 2012| ] called 


the Maximum Mean Discrepancy, or MMD. Einally, the desire to design a univariate 
distribution-free Wasserstein test will lead us to the formal s t udy o f Receiver Operating 


Characteristic (ROC) curves, relating to work bv iHsieh et al.l 199fil ]. 


Intuitively, the underlying reasons for the similarities and differences between these 
above tests can be seen through two lenses. First is the population viewpoint of how 
different tests work with different representations of distributions; most of these tests are 
based on differences between quantities that completely specify a distribution — (a) cumu¬ 
lative distribution functions (CDFs), (b) quantile functions (QFs), and (c) characteristic 
functions (CFs). Second is from the sample viewpoint of the behavior these statistics show 
under the null hypothesis; most of these tests have null distributions based on norms of 
Brownian bridges, alternatively viewed as inhnite sums of weighted chi-squared distribu¬ 
tions (due to the Karhunen-Loeve expansion). 

While we connect a wide variety of popular and seemingly disparate families of tests, 
there are still further classes of tests that we do not have space to discuss. Some ex- 
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amples of tests quite differe nt from the ones studied here include rank based tests as 
covered by the excellent book Lehmann and D’Abrera 200^ , and graphical tests that in- 
clude spanning tree method s by Friedman and Rafskv 1979| | ( generalizing the runs test by 
Wald and Wolfowitz 194fll ]). nearest -neighbor based te sts bv lschillind f! 


and Henze 


1988| |. and the cross-match tests bv iRosenbaumI [2005l |. The book by iThasI |201[ll | is also 


a very nseful reference. 


Paper Outline and Contributions. The rest of this paper proceeds as follows. In 
Section [21 we formally present the notation and setup of nonparametric two sample testing, 
as well as briefly introduce three different ways of comparing distributions —using CDFs, 
QFs and CFs. In Section [3] we form a novel connection between the multivariate Wasser- 
stein distance, to the multivariate Energy Distance, and to the kernel MMD, through an 
entropy-smoothed Wasserstein distance. In Section 0] we relate the univariate Wasserstein 
two-sample test to PP and QQ plots/tests. Lastly, in Section [5l we will design a univariate 
Wasserstein test statistic that is also “distribution-free” unlike its classical counterpart, 
providing a careful and rigorous analysis of its limiting distribution by connecting it to 
ROC/ODC curves. 


2 Nonparametric Two Sample Testing 

More formally, given i.i.d. samples Xi,...,Xn ~ P and ~ Q, where P and Q 

are probability measures on We denote by Pn and Qm the corresponding empirical 
measures. A test r] is a function from the data Dm,n ■= {Xi, ...Xn, Ti,..., Tm} £ 
to {0,1} (or to [0,1] if it is a randomized test). 

Most tests proceed by calculating a scalar test statistic Tm,n ■= T{Dm,n) S 1^ and 
deciding Hq or Hi depending on whether Tm,n) after suitable normalization, is smaller 
or larger than a threshold ta- ta is calculated based on a prespecified false positive rate 
a, chosen so that, < a, at least asymptotically. Indeed, all tests considered in this 

paper are of the form 


r]{Xi,...,Xn,Yi,...,Ym) =I(Tm,n > ta) 

We follow the Neyman-Pearson paradigm, were a test is jndged by its power cp = n, d, P, Q, a) 
We say that a test i] is consistent, in the classical sense, when 

4> ^ 1 as m,n ^ oo, a —)• 0. 

All the tests we consider in this paper will be consistent in the classical sense mentioned 
above. Establishing general conditions nnder which these tests are consistent in the high¬ 
dimensional setting is largely open. All the test statistics considered here are of the form 
that they are typically small under Hq and large under Hi (usnally with appropriate 
scaling, they converge to zero and to infinity respectively with infinite samples). The 
aforementioned threshold ta will be determined by the distribution of the test statistic 
being nsed under the null hypothesis (i.e. assuming the null was true, we would like to 
know the typical variation of the statistic, and we reject the nnll if our observation is 
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far from what is typically expected under the null). This naturally leads us to study the 
null distribution of our test statistic, i.e. the distribution of our statistic under the null 
hypothesis. Since these are crucial to running and understanding the corresponding tests, 
we will pursue their description in some detail in this paper. 


2.1 Three Ways to Compare Distributions 


The literature broadly has three dominant ways of comparing distributions, both in one 
and in multiple dimensions. These are based on three different ways of characterizing 
distributions — cumulative distribution functions (CDFs), characteristic functions (CFs) 
and quantile functions (QFs). Many of the tests we will consider involve calculating 
differences between (empirical estimates of) these quantities. 


For exa ,mple, it is well known that the Kolmogorov-Smirnov (KS) test bv iKolmogorov 


I 933 I ] and SmirnovI 


involves differences in empirical CDFs. We shall later see that 
in one dimension, the Wasserstein distance calculates differ ences in QFs. 

T he KS test, the related Cramer v on-Mises criterion by Cramer 19281 ] and Von Mis^ 
19281 ]. and Anderson-Darling test by Anderson and Darling 1952] are very popular in 
one dimension, but their usage has been more restricted in higher dimensions. This is 
mostly due to the curse of dimensionality involved with estimating multivariate empirical 
CDFs. While t here has been work on generalizing these popular one-dimensional to higher 
dimensions, like Bickell 1969 ]. these are seemingly not the most common multivariate tests. 

Two classes of tests that are actually quite popular are kernel and distance based 
tests. As we will recap in more detail in later sections, it is known that the Gaussian 
kernel MMD implicitly calculates a (weighted) difference in CFs and the Euclidean energy 
distance implicitly works with a difference in (projected) CDFs. 


3 Entropy Smoothed Wasserstein Distances 


The theory of optimal transport (see Villani . 20091 ]) provides a set of powerful tools to 
compare probability measures and distributions on through the knowledge of a metric 
on which we assume to be the usual Euclidean metric in what follows. Among that 
set of tools, the following family of p-Wasserstein distances between probability measures 
is the best known. 


3.1 Wasserstein Distance 

Given an exponent p > 1, the definition of the p-Wasserstein distance reads: 

Definition 1 (Wasserstein Distances). Forp G [l,oo) and Borel probabilit y measures P,Q 
on with finite p-moments, their p-Wasserstein distance \Villa,n\. \200!i . Seet. 6] is 

Wp{P,Q)=( inf / \\X-Yrd7r)^\ (1) 

V7rer(p,Q) jiRdxRd J 

where F(P, Q) is the set of all joint probability measures on x whose marginals are 
P, Q, i.e. sueh that for all subsets A G we have ■k{A x M'^) = P{A) and 7r(]R'^ x A) = 
Q{A). 
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A remarkable feature of Wasserstein distances is that Definition [T] applies to all mea¬ 
sures regardless of their absolute continuity with respect to the Lebesgue measure: the 
same definition works for both empirical measures and for their densities if they exist. 

Writing 1„ for the n-dimensional vector of ones, when comparing two empirical mea¬ 
sures with uniform^ weight vectors Xnjn and the Wasserstein distance Wp{Pn, Qm) 

exponentiated to the po wer y is the optimum of a net work flow problem known as the 
transportation problem |Bertsimas and Tsitsiklis . 1997 . Section 7.2], This problem has 
a linear objective and a polyhedral feasible set, defined respectively through the matrix 
Mxy of pairwise distances between elements of X and Y raised to the power p. 


Mxy ■■= [||X, - e 


(2) 


and the polytope Unm dehned as the set of n x m nonnegative matrices such that their 
row and column marginals are equal to In/u and Im/w- respectively: 


Unm {T S 


Tim = In/n, r^ln = ImM}. 


(3) 


Let (A, B ) := trace(A^B) be the usual Frobenius dot-product of matrices. Combining 
Eq. dll) and ([3|), we have that Wp{Pn, Qm) is the optimum of a linear program S' of n x m 
variables. 


W^Pn, Qm) = min (T, Mxy 


TeUr, 


(4) 


of feasible set Unm and cost matrix Mxy- 

We hnish this section by pointing out that the rate of convergence as n, m —>■ oo of 
Wp{Pn-,Qm) towards Wp{P,Q) gets slower as the dimension d grows under mild assump- 
tions. For sim plicity of exposition consider m = n. For any p G [l,oo), it follows from 
Dudley |l968| | that for d > 3, the difference between Wp{Pn,Qn) and Wp{P,Q) scales as 

We also point out that when d = 2 the rate actually scales as (see Altai et al 


n 


y/n 

Finally, we note that when consideri ng p = oo the rates of convergence are differ¬ 


ent to those when 1 < p < oo . T he work of Leighton and Shor 1989l |. Shor and Yukich 
199 ll |. Garcia and Slepcev 2015 ] show that the rate of convergence of Woo{Pn,Qn) to¬ 
wards Woo{P, Q) is of order when d > 3 and ^ when d = 2. Hence, the 

original Wasserstein distance by itself may not be a favorable choice for a multivariate two 
sample test. 


3.2 Smoothed Wasserstein Distance 

Aside from the slow convergence rate of the Wasserstein distance between samples from 
two different measures to their distance in population, computing the optimum of ([H) is 
expensive. This can be easily seen by noticing that the transportation problem boils down 
to an optimal assignment problem when n = m. Since the resolution of the latter has a 
cubic cost in n, all known algorithms that can solve the optimal transp ort problem scale a t 
least super-cubicly in n. Using an idea that can be traced back as far as ISchrodingen 19311], 


^The Wasserstein machinery works also for non-uniform weights. We do not mention this in this paper 
because all of the measures we consider in the context of two-sample testing are uniform. 
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Cuturi 2 OI 3 I I recently proposed to use an entropic regularization of the optimal transport 
problem, to define the Sinkhorn divergence between P, Q parameterized by A > 0 as 


5^(P,Q):= min \{T, Mxy ) - E{T). (5) 

1 ^Unm 

where E{T) is the entropy of T seen as a discrete joint probability distribution, namely 
E{T) := — Tij log(Tjj). Let Tx be the minimizer of the above smoothed optimal trans¬ 
port problem. 

This approach has two benehts: (i) because E{T) is 1-strongly convex with respect 
to the ii norm, the regularized problem is itself strongly convex and admits a unique 
optimal solution Tx, as opposed to the initial OT problem, for which the minimizer may 
not be unique; (ii) this optimal solution Tx is a diagonal scaling of , the element¬ 

wise exponential matrix of —Mxy- One can easily show using the Lagrange method of 
multipliers that there must exist two non-negative vectors u G M"", v G M™' such that 
Tx := Dy, where Dy are diagonal matrices with u and v on their diagonal. 

The soluti on to this d iagon al scaling problem can be found efficient ly through Sinkhorn’s 


algor ithm Sinkhornl . Il967l |. which has a linear convergence rate Franklin and Lorenz . 


19891]. Sinkhorn’s algorithm can be implemented in a few lines of code that only require 
matrix vector products and elementary operations, hence easily parallelized on modern 
hardware. 


3.3 Smoothing the Wasserstein Distance to Energy Distance 

An interesting class of test s are d istan ce-based “energy statistic s” as introduced in parallel 
by Baringhaus and Franz 2004 ] and Szekelv and Riz^ 2004 j. The statistic, called the 
Cramer statistic by the former paper and Energy Distance by the latter, corresponds to 
the population quantity 

ED := 2E||X-y||-E||X-X'||-E||y-y'||, 

where A, X' ~ P and Y,Y' ^ Q (all i.i.d.). An associated test statistic can be calculated 
as 


ED, := 


—EEi 

mn 


1 


n 


1 


m 


11;.-E,-| 


2=1 J = 1 ^5^ = 1 ^5^ = 1 

Remarkably, ED(P, Q) = 0 iff P = Q. Hence, rejecting when ED, is larger than an 
appropriate threshold leads to a test which is consistent against all fixed alternatives where 
P ^ Q under mild conditions (like finiteness of E[A],E[y]); see aforementioned references 
for details. Then, the Sinkhorn divergence defined in ([5]) can be linked to the the energy 
distance when the regularization parameter is set to A = 0, through the following formula: 

ED, = 2S^{Pn, Qm) - S^{Pn, Pn) " S^{Qm, Qm)- (6) 


Indeed, notice first that Tq is the maximal entropy table in Unm, namely the outer product 
(Inlm)/^^ of ffio marginals In/n and Im/Tn. Then ([6|) follows from the observation 


Sl{Pn,Qm) = 


1 


nm 


El 


1 


1 


Xi-Y^\\, Sl{Pn,Pn) = Sl{Qm,Qm) = ^ 


hi=i 
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3.4 From Energy Distance to Kernel Maximum Mean Discrepancy 


Another popular class of tests th at has emerged over the l ast decade, are kernel- based 


tests introduce d independent l y bvlQretton et alJ |2nnfi^ and iFernandez et alJ [2008^ . and 


expanded on in iGretton et al.1 2012l |. Without getting into technicalities that are irrelevant 
for this paper, the Maximum Mean Discrepancy between P, Q is defined as 

MMD{Hk,P,Q):= max Epf{X)-EQf{Y) 

where Pk is a Reproducing Kernel Hilbert Space associated with Mercer kernel and 

ll/ll'Hfc < 1 is its unit norm bal l. While it is easy to see that MMD > 0 always, and also 
that P = Q implies MMD = 0, Gretton et ah [20061 ] show that if k is “characteristic”, the 
equality holds iff P = Q (the Gaussian kernel k{a,b) = exp(— jja — 6|P/7^) is a popular 
example). Using the Riesz representation theorem and the reproducing property of Pk, 
one can argue that MMD{Pk,P,Q) = \\Epk{X,.) — EQk{Y, .)\\'Pf, and hence using the 
reproducing property again, one can conclude that 

MMD^ = Ek{X, X') + EkiY, Y') - 2Ek{X, Y). 

This gives rise to a natural associated test statistic, a plugin estimator of MMD^; 

^ n - m ^ n m 

mmd2 (A:(., •)) ■■= " — EE kiXi,Yj). 


*j=i 






mn 


i=l j=l 


Apart from the fact that MMD(P, Q) = 0 iS P = Q the other fact that makes this a useful 
test statistic is that its estimation error, i.e. the error of MMD^ in estimating MMD^, 

independent of dH. See Gretton et al. 2012| | for a detailed proof. 


m-\-n 

mn 


scales like 

At first sight, the Energy Distance and the MMD look like fairly different tests. 
However, there is a natural connection that proceeds in two steps. Firstly, there is no 
reason to stick to only the Euclidean norm || • ||2 to measure distanc es for ED — the 
test can be extended to other norms, and in fact also other metrics; LvonsI 20131 ] ex- 
plains the details for t he closely related independence testing problem. Following that. 


Seidinovic et al.l 20131 ] discuss the relationship between distances and kernels (again for 


independence testing, but the same arguments hold in the two sample testing setting also). 
Loosely speaking, for every kernel k, there exists a metric d (and also vice versa), given 
by d{x, y) := {k{x, x) + k{y, y))/2 — k{x, y), such that MMD with kernel k equals ED with 
metric d. This is a very strong connection between these two families of tests. 


4 Wasserstein Distance and PP or QQ tests 

For univariate random variables, a PP plot is a graphical way to view differences in 
empirical GDFs, while QQ plots are analogously for comparing QFs. Instead of relying on 

^This is unlike KL-divergence, which is also zero iff P = Q, but is in general hard to estimate in high 
dimensions. 
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graphs, we can also make such tests more formal and rigorous as follows. We first present 
some results on the asymptotic distribution of the difference between and Qm when 
using the distance between the CDFs Fn and Gm and then later when using the distance 
between the QFs and For simplicity we assume that both distributions P and Q 
are supported on the interval [0,1]; we remark that under mild assumptions on P and Q, 
the results we present in this section still hold without such a boundedness assumption. 
Moreover we assume for simplicity that the CDFs F and G have positive densities on 
[ 0 , 1 ]. 


4.1 Comparing CDFs (PP) 

We start by noting Fn may be interpreted as a random element taking values in the space 
D([0,1]) of right continuous functions with left limits. It is well known that 

^/n {Fn - F) Mo F (7) 


where B is a standard Brownian bridge in [0,1] and where the weak convergence is 
un derstood a s conv ergence of probability measures in the space P([0,1]); see Chapter 3 


m 


Billingsley 1968l | for details. From this fact and the independence of the samples, it 


follows that under the null hypothesis PIq : P = Q, asn,m —>-00 


mn 


m + n 


{Fn - Gm) = 


mn 


m + n 


{Fn -F) + 


mn 


m-\- n 


{G-Gr, 


> o F. 


( 8 ) 


The previous fact, and continuity of the function h S D([0,1]) e-)- fg{h{t))^dt, imply 
that as n,m —>■ 00 , we have under the null. 


[\Fn{t) - Gm{t))^ dt [\M{F{t)))^dt. 
m + n Jq Jq 

Observe that the above asymptotic null distribution depends on F which is unknown in 
practice. This is an obstacl^ when considering any L^-distance, with 1 < p < 00 , between 
the empirical cdfs Fn and Gm- Luckily, a different situation occurs when one considers 
the L°°-distance between Fn and Gm- Under the null, using again ([7]) we deduce that 


Fn — Cmiloo -^w II® O F||oo — ||®||oo) (9) 

where the equality in the previous expression follows from the fact that the continuity of 
F implies that the interval [0,1] is mapped onto the interval [0,1]. This test statistic, the 
so-called Kolmogorov-Smirnov test statistic, is hence appropriate for two sample problems. 

®This obstacle can be avoided in the goodness-of-fit testing context, when we want to test if Xi ,..., Xn 
was drawn from a known CDF F or not. This is the original purpose of the L^-statistics of the von Mises 
type. Briefly, from 0, and since the function / £ P{[0, 1]) !-->■ (f{t))^dF{t) is continuous, we deduce 

- F{t)fdF{t) r M{F{t)fdF{t) = [\-B{s)fds, 

Jo Jo Jo 


mn 
m + n 


where the second equality follows by a change of variables, leading to an expression that does not depend 
on F. 



















4.2 Comparing QFs (QQ) 

We now turn our attention to QQ (quantile-quantile) plots and specifically the L^-distance 
between F~^ and G^. It can be shown that if F has a differentiable density / which (for 
the sake of simplicity) we assume is bounded away from zero, then 


v^(F-i-F-i) 


/oF-i- 


For a proof of the above statement see Chapter 18 in Shorack and Wellner 1986l |: for an 


alternative proof where the weak convergence is considered in the space of probability 
measur es on L^((0,1)) ( as opposed to the space P([0,1]) we have been considering thus 
far) see del Barria 2004 ]. 

We note that from the previous result and independence, it follows that under the null 
hypothesis Hq : P = Q, 


mn 


[F-^ - G-^) 


m + n' "■ ' foF-^' 

In particular by continuity of the function h G F^((0,1)) i-)- f^(h(t})^dt, we deduce that 


mn 


m + n Jq 


[\Fn^ 

Jo 


- Gj?dt 


f 




0 {foF-^it)y 


dt. 


Hence, as was the case when we considered the difference of the cdfs Fn and Gm, the 
asymptotic distribution of the L^-difference (or analogously any L^-difference for finite p) 
of the empirical quantile functions is also distribution dependent. Note however that there 
is an important difference between QQ and PP plots when using the L°° norm. We saw 
that the asymptotic distribution of the L°° norm of the difference of Fn and Gm is (under 
the null hypothesis) distribution free. Unfortunately, in the quantile case, we obtain 


mn 


m + n 


Fz^ - g : 


-ii 


'/oF-i 


which of course is distribution dependent. Since one would have to resort to computer¬ 
intensive Monte-Carlo techniques (like bootstrap or permutation testing) to control type-1 
error, these tests are sometimes overlooked (though with modern computing speeds, they 
merit further study). 


4.3 Wasserstein is a QQ test 

We recall that in general, for p G [1, oo) the p-Wasserstein distance between two probability 
measures P,Q on M. with finite p-moments is given by 

Wp{P,Q):= inf (I \\x - y\\PdTT{x,y)] . (10) 

7rer(p,Q) VVrxR J 

Because the Wasserstein distance measures the cost of transporting mass from the 
original distribution P into the target distribution Q, one can say that it measures ’’hori¬ 
zontal” discrepancies between P and Q. Intuitively, two probability distributions P and Q 
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that are different over ’’long” (horizontal) regions will be far away from each other in the 
Wasserstein distance sense, because in that case mass has to travel long distances to go 
from the original distribution to the target distribution. In the one dimensional case (in 
contrast with what happens in dimension d > 1), the p-Wasserstein distance has a simple 
interpretation in terms of the quantile functions F~^ and G~^ of P and Q respectively. 
The reason for this is that the optimal way to transport mass from P to Q has to satisfy 
certain monotonicity property which we describe in the proo f of the foll owing Lemma. 
This is a well known fact that can be found, for example, in iThasI 2010(]. Nevertheless 


here we present its proof in the Appendix for the sake of completeness. 


Proposition 1. The p-Wasserstein distance between two probability measures P and Q 
on M with p-finite moments can be written as 

WP{P,Q)= [\F-\t)-G-\t)\Pdt, 

Jo 

where F~^ and G~^ are the quantile functions of P and Q respectively. 

Having considered the p-Wasserstein distance Wp{P,Q) for p G [l,oo) in the one 
dimensional case, we conclude this section by considering the case p = oo. Let P,Q be 
two probability measures on M with bounded support. That is, assume that there exist 
a number > 0 such that supp{P) C [—A^, A^] and supp{Q) C [—A^, A^]. We define the 
oo-Wasserstein distance between P and Q by 


Woo{P-,Q) '■= inf esssupT^\x — y\. 


Proceeding as in the case p G [1, oo), it is possible to show that the oo-Wasserstein distance 
between P and Q with bounded supports can be written in terms of the difference of the 
corresponding quantile functions as 


Woo{P,Q) = \\F-^ -G-^ 


The Wasserstein distance is also sometimes called the Kantorovich-Rubinstein metric 
and the Mallow’s distance in the statistical literature, where it has been studied extensively 
due to its ability to capture weak convergence precisely - Wp{Fn,F) converges to 0 if 
and only if Fn converges in distribution to F and als o the p-th momen t of X under P„ 


converges to the correspond ing moment under F] see Dobrushii] 1970| |. Mallows 1972l | 


Bickel and FreedmanI 1981 |. 


5 A Distribution-Free Wasserstein Test 

As we earlier saw, under the null hypothesis Hq : P = Q, the statistic (FF^(t) — G”^(t))^ dt 

has an asymptotic distribution which is not distribution free, i.e., it depends on F. We 
also saw that as opposed to what happens with the asymptotic distribution of the L°° 
distance between Fn and Gm, the asymptotic distribution of “ Cm^llcxD does depend 
on the cdf F. 

In this section we show how we can construct a distribution-free Wasserstein test. To 
prove that it is distribution-free, we connect it to the theory of ROC and ODC curves. 
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5.1 Relating Wasserstein Distance to ROC and ODC curves 

Let P and Q be two distributions on M with cdfs F and G and quantile functions F~^ 
and G~^ respectively. We define the ROC curve between F and G as the function. 

ROG{t):=l-F{G-\l-t)), te[0,l]. 


In addition, we define their ODC curve by, 


ODC{t) ■.= G{F-\t)), te[0,l]. 


The f ollowing are straightforward properties of the ROC curve (see iHsieh and Turnbull 
' l996l |i. 


1 . The ROC curve is increasing and ROC{0) = 0, iiOC'(l) = 1. 

2. If G(t) > F{t) for all t then ROC{t) > t for all t. 

3. If F and G have densities with monotone likelihood ratio, then the ROC curve is 
concave. 

4. The area under the ROC curve is equal to P(y < X), where Y ^ Q and X ^ P. 


Intuitively speaking, the faster the ROC curve increases towards the value 1, the easier it 
is to distinguish the distributions P and Q. Observe from their definitions that the ROC 
curve can be obtained from the ODC curve after reversing the axes. Given this, we focus 
from this point on only one of them, the ODC curve being more convenient. 

The first observation about the ODC curve is that it can be regarded as the quantile 
function of the distribution GjjR (the push forward of P by G) on [0,1] which is defined 
by 

GjP([0,a)) :=P(G-i([0,a))), aG [0,1]. 

Similarly, we can consider the measure Gm^Pn, that is, the push forward of Pn by Gm- We 
crucially note that the empirical ODC curve Gm ° is the quantile function of Gm^Pn- 
From Section [H we deduce that 

WP{GmiPn,G^P) = [\GmoF-\t)-GoF-\t)\Pdt 
Jo 

for every p G [1, oo) and also 

W^{GmiPn,G^P) = WGmOP-^ -Go P-^W^. 


That is, the p-Wasserstein distance between the measures Gm^Pn and G^P can be com¬ 
puted by considering the distance of the ODC curve and its empirical version. 

First we argue that under the null hypothesis Hq : P = Q, the distribution of em¬ 
pirical ODC curve is actually independent of P. In particular, Wp{Gm\^Pn,G^P) and 
Woo{GmiPn,G^P) are distribution free under the null! This is the content of the next 
lemma, proved in the Appendix. 
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Lemma 1 (Reduction to uniform distribution). Let F,G be two continuous and strictly 
increasing CDFs and let Xi,, Xn ~ F and Yi,..., ^ G be two independent samples. 

We let Fn and Gm be the CDFs associated to the empirical distributions induced by the Xs 
and the Ys respectively. Consider the (unknown) random variables, which are distributed 
uniformly on [0,1], 

Rf := F{Xk), := G(n). 

Let F({ he the empirical CDF associated to the (uniform) s and let G^ be the empirical 
CDF associated to the (uniform) U^s. Then, under the null hypothesis Hq : F = G we 
have 

GmiXk) = G^iU^), 'dkG{l,...,n}. 

In particular, 

GmoF-\t) = Gl(,oF(;-\t), Vte[0,l]. 


Note that since , U(( are obviously instantiations of uniformly distributed random 
variables, the RHS of the last equation only involves uniform r.v.s and hence, the distri¬ 
bution of Gm o FF^ is independent of F, G under the null. Now we are almost done - this 
above lemma will imply that the Wasserstein distance between Gm ° uniform 

distribution U[0, 1] (since G o F~^{t) = t = U~^{t) = U{t) for t G [0,1] when G = F) also 
does not depend on F, G. 

More formally, we establish a result related to the asymptotic distribution of Wp (Crntt^Pn 
and Woo{GmfiPn, G^P). We do this by first considering the asymptotic distribution of the 
difference between the empirical ODC curve and the population ODC curve regarding 
both of them as elements in the space P([0, 1]). This is the conte nt of the following Theo¬ 


rem w hich follows directly from the work of iKomlos et al 
1996l |b 




see 


Hsieh and Turnbull 


Theorem 1. Suppose that F and G are two cdfs with densities f,g satisfying 

giP-Ht)) 




<G, 


for all t G [0, Ij. Also, assume that 


n 

m 


A G [0, oo) as n, m —)• oo. Then, 


mn 


m + n 


{GmiF-\-))-GiF-H-))) 




A + 1 


A + l/(T-i(-))- 


where Bi and B 2 are two independent Brownian bridges and where the weak convergence 
must be interpreted as weak convergence in the space of probability measures on the space 
P([0,1]). 

As a corollary, under the null hypothesis Hq : P = Q we obtain the following. Suppose 
that the CDF F of P is continuous and strictly increasing. Then, 


mn 
m + n 


Wi{GmiPn,G^P) = 


mn 


{Gm{F-\t))-tfdt 


m + n Jq 


[\n{t))‘ 

Jo 


dt 


G^P) 
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and 


mn 
m + n 


Woc{GmiPn,GiP) 


mn 
m + n 


sup \GjniPn ^(i)) - t\ 
i6[0,l] 


sup |B(t)|. 
46 [ 0 , 1 ] 


To see this, note that by Lemma[T]it suffices to consider F{t) = t in [0,1]. In that case, 
the assumptions of Theorem [1] are satisfied and the result follows directly. 


The takeaway message of this section is that instead of considering the Wasserstein 
distance between Fm and Gn, whose null distribution depends on unknown F, one can 
instead consider the Wasserstein distance between Gm{FF^) and the uniform distribu¬ 
tion C/[0,1], since its null distribution is independent of F, i.e. we have constructed a 
distribution-free test. 


6 Conclusion 

In this paper, we connect a wide variety of univariate and multivariate test statistics, 
with the central piece being the Wasserstein distance. The Wasserstein statistic is closely 
related to univariate tests like the Kolmogorov-Smirnov test, graphical QQ plots, and a 
distribution-free variant of the test is proposed by connecting it to ROC/ODC curves. 
Through entropic smoothing, the Wasserstein test is also related to the multivariate tests 
of Energy Distance and hence transitively to the Kernel Maximum Mean Discrepancy. 

We hope that this is a useful resource to connect the seemingly vastly different families 
of two sample tests, many of which can be analyzed under the two umbrellas of our paper 
- whether they differentiate between CDFs, QFs or CFs, and what their null distributions 
look like. A comprehensive empirical survey is also of interest but out of our current scope. 
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A Proof of Proposition [T] 


Proof: We first observe that the infimum in the definition of Wp{P,Q) can be replaced 
by minimum, namely, there exists a transportation plan vr G T[P,Q) that achieves the 
infimum in (llOj) . This can be deduced in a straightforward way by noting that the expres¬ 
sion /jjxrI® “ yY’d'K{x,y) is linear in vr and that the set T{P,Q) is compact in the sense 
of weak convergence of probability measures on M x M. Let us denote by vr* an element 
in T{P,Q) realizing the minimum in (fTOl) . Let {xi,yi) G supp{'K*) and {x 2 ,y 2 ) £ supp{'K*) 
(here supp{iT*) stands for the support of tt) and suppose that xi < X 2 - We claim that the 
optimality of vr* implies that yi < y 2 - To see this, suppose for the sake of contradiction 
that this is not the case, that is, suppose that y 2 < yi- We claim that in that case 

- 2/2^ + ^2 - yiT < - yi? + ^2 - y2\^- (n) 

Note that for p = 1 this follows in a straightforward way. For the case p > 1, first note 
that xi < X 2 and y 2 < yi imply that there exists t G (0,1) such that txi -|- (1 — t)yi = 
tx 2 + (1 — t)y 2 - Now, note that 

\xi - y 2 \ = ki - {txi -k (1 - t)yi)\ + \ {txi -k (1 - t)yi) - y 2 \ 

because the points xi, y 2 and txi -k (1 — t)yi all lie on the same line segment. But then, 
using the fact that txi -k (1 — t)yi = tx 2 + (1 — t)y 2 , we can rewrite the previous expression 
as 

\xi - y 2 \ = (1 - t)\xi - yi\ -k t\y 2 - X 2 \. 

Using the strict convexity of the function t ^ { when p > 1), we deduce that 

\xi - 2 / 2 ^ < (1 - t)\xi -yi\^ + t\x 2 - y 2 \^- 
In a similar fashion, we obtain 

\X2 - yi\^ < t\xi - yi\P + (1 - t)\x2 - 2 / 2 ^. 

Adding the previous two inequalities we obtain (jllh . Note however that dm) contradicts 
the optimality of vr*, because it shows that vr* is not cyclically monotone, which essentially 
means that it is possible to rearrange the way mass is transported from P to Q by vr* in 
order to reduce the transportation cost (it would be cheaper to send mass from xi to 2/2 
and from X2 to 2/1 than to send mass from xi to 2/1 and from X2 to 2 / 2 )- Therefore, we 
conclude that if (xi,2/i) G supp{'K*), (x2,2/2) G and xi < X 2 , then 2/1 < 2/2- 

Now, for X G supp{P) and y G supp{Q) we claim that {x,y) G supp{'n*) if and only 
if F{x) = G{y). To see this note that from the monotonicity property just established 
we deduce that {x,y) G supp{'n*) if and only if 7r*(M, (— 00 , 2 /]) = 7r*((— 00 , x]), (— 00 , 2 /]) = 
7r*((—oo,x],R). In turn, the fact that vr* G r(P,Q) implies that 7r*((— 00 ,x],M) = F{x) 
and 7r*(M, (— 00 , 2 /]) = G{y). From the previous relation we conclude that 

[ \x-y\Pd7T*{x,y) = [ \x - y\Pd7T*{x,y) = [ \F~^{t) - G~^{t)\Pdt, 

JMxR J supp{7r*) Jo 

as we wanted to show. 
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B Proof of Lemma [T] 


Proof: We denote by < ••• < P(m) the order statistic associated to the Ps. For 
k = — 1 and t € (0,1), we have Gm{t) = ^ if and only if t G [Y(^k):^{k+i)) 

which holds if and only if t G [F~^, F~^, which in turn is equivalent to 

F{t) G [U^k)’^lk+i)')- Thus, Gm{t) = ^ if and only if G’^{F{t)) = From the previous 
observations we conclude that Gm = Cm ° T. Finally, since = F~^{U^) we conclude 
that 


Gm(Xfc) = GIoFo F-\U^) = G^^{U^). 
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